-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
traverser: fix default job duration when no duration specified #1104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch and the fix is elegant.
I'm requesting a few minor changes: a slight change to the now
computation, a nit on the second commit message, and also that the commit order be swapped so the testsuite changes come after what they're testing.
resource/traversers/dfu_impl.hpp
Outdated
@@ -70,8 +70,9 @@ struct jobmeta_t { | |||
now = t; | |||
jobid = id; | |||
alloc_type = alloc; | |||
const auto now = std::chrono::system_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think redeclaring now
in this way will cause slight drift in the time when the updating traversal determines idata
availability:
int64_t avail = planner_multi_avail_resources_at (subplan, |
versus the at
time passed via meta
here:
flux-sched/resource/traversers/dfu.cpp
Line 325 in 8da547c
rc = detail::dfu_impl_t::update (root, writers, meta); |
The drift would probably be very small, but it could conceivably cause the avail
variable to be set incorrectly here:
if (avail == 0 && planner_multi_span_size (subplan) == 0) |
resource/traversers/dfu_impl.hpp
Outdated
int64_t g_duration = std::chrono::duration_cast<std::chrono::seconds> | ||
(graph_duration.graph_end - graph_duration.graph_start).count (); | ||
(graph_duration.graph_end - now).count (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of the above, my preference is to do something like this:
(graph_duration.graph_end - now).count (); | |
(graph_duration.graph_end - std::chrono::system_clock::now ()).count (); |
0336899
to
48b1444
Compare
Thanks @milroy! I've made the suggested changes and pushed the result. |
resource/traversers/dfu_impl.hpp
Outdated
(graph_duration.graph_end - | ||
std::chrono::system_clock::now()).count (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @grondo, but after more thought my previous suggestion results in a small drift, too. Can you do this instead:
(graph_duration.graph_end - | |
std::chrono::system_clock::now()).count (); | |
(graph_duration.graph_end - now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do! However, for the uninitiated, can you describe the problem being fixed by this change? I only understand enough of std::chrono
to make this one change, and with this subtle of a change it seems this might be fairly brittle. At least I could add a comment here for future code spelunkers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for my request was to reduce the number of independent measurements of when now is throughout Fluxion which I hope makes drift debugging easier.
Your question is good and caused me to reconsider the g_duration
computation and the downstream use. I think there's a bigger problem with the logic of aligning the job duration with the graph_end
time. Issue #1103 reveals that the check here (which was designed to catch the problem) isn't working correctly:
flux-sched/resource/traversers/dfu.cpp
Line 323 in bf62d8a
meta.duration = graph_end - *at; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only just realized what you said initially: that I was redeclaring now
in the original patch. Somehow I missed that there was already a now
declared in the function. 🤦 Sorry about that!
However, using the existing int64_t
now results in a compilation error. I'm guessing some kind of cast needs to be used, but being pretty unfamilar with std::chrono
I'm not sure what to use.
Since g_duration
and now
are already int64_t
type, would it make sense to just cast graph_duration.graph_end
to int64_t
as well before taking the difference, e.g.
diff --git a/resource/traversers/dfu_impl.hpp b/resource/traversers/dfu_impl.hpp
index 098fcd41..f9fa8f9b 100644
--- a/resource/traversers/dfu_impl.hpp
+++ b/resource/traversers/dfu_impl.hpp
@@ -70,9 +70,9 @@ struct jobmeta_t {
now = t;
jobid = id;
alloc_type = alloc;
- int64_t g_duration = std::chrono::duration_cast<std::chrono::seconds>
- (graph_duration.graph_end -
- std::chrono::system_clock::now()).count ();
+ int64_t g_end = std::chrono::system_clock::to_time_t
+ (graph_duration.graph_end);
+ int64_t g_duration = g_end - now;
if (g_duration <= 0) {
errno = EINVAL;
I really know nothing about std::chrono
and was honestly having trouble casting now
to the appropriate type otherwise. Please feel free to suggest anything you'd like (or even go ahead and push the correct fix to this branch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your question is good and caused me to reconsider the g_duration computation and the downstream use. I think there's a bigger problem with the logic of aligning the job duration with the graph_end time. Issue #1103 reveals that the check here (which was designed to catch the problem) isn't working correctly:
Ah, yes, I see now that the "fix" attempted in this PR won't really fix anything. If the start of the job is delayed, then its expiration may still be greater than the instance expiration because real time has passed between when build()
is called and the job's actual starttime.
I think the real fix is in the code you've called out above, or to have a way to just set the job expiration to the instance expiration, regardless of the job's duration (though I don't know the code well enough to determine how to do that).
Also, unless I'm mistaken, it seems like setting the job's duration to the instance duration in build()
isn't doing anything, and that code should be removed to avoid confusion.
Ideally, we'd like to get this and the follow on PR #1105 fixed before the tag tomorrow. Maybe I'll look into why the duration adjustment in run()
isn't working (as best I can), but could use help if you're available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, when I print some debug from that line, graph_end
seems to have the wrong scale:
at=1699291701 meta.duration=120 graph_end=1699291819000000000
Does graph_end
need one of those crazy std::chrono::duration_cast<std::chrono::seconds>
casts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milroy: FYI I've dropped the initial "fix" here (which wasn't a fix at all as you pointed out) in favor of this:
diff --git a/resource/traversers/dfu.cpp b/resource/traversers/dfu.cpp
index 4f7e7c73..061b6844 100644
--- a/resource/traversers/dfu.cpp
+++ b/resource/traversers/dfu.cpp
@@ -290,7 +290,9 @@ int dfu_traverser_t::run (Jobspec::Jobspec &jobspec,
}
int rc = -1;
- int64_t graph_end = graph_duration.graph_end.time_since_epoch ().count ();
+ int64_t graph_end = std::chrono::duration_cast<std::chrono::seconds>
+ (graph_duration.graph_end
+ .time_since_epoch ()).count ();
detail::jobmeta_t meta;
vtx_t root = get_graph_db ()->metadata.roots.at (dom);
bool x = detail::dfu_impl_t::exclusivity (jobspec.resources, root);
I may have guessed wrong at the correct fix here, so feel free to correct me again.
I've also fixed up the test itself, which wasn't even testing with Fluxion modules in the previous version 🤦 Basically, this is a whole new PR...
I also noticed that the added test here does not work because So, I'll be force-pushing this branch with a completely different fix and the testsuite improved. |
Problem: In dfu_traverser_t::run() the current graph_end is compared to the current `*at` time to ensure the job duration won't overrun the graph expiration. However, the local graph_end variable is created using the default graph_duration.graph_end internal count, which may not be in seconds. Explicitly cast the graph_end to std::chrono::seconds so that the value of the local graph_end variable is the same units as `*at` and `duration` which are seconds. Fixes flux-framework#1103
Problem: The test in t4011-match-duration.t tests that the duration of a job is inherited from the enclosing instance duration, but we should really be testing that the *expiration* of a child job is inherited from the parent when no duration is specified. O/w, a job submitted when the instance has only a few minutes remaining could have its expiration set for long after the instance is terminated. Update the test to ensure that expiration of a job with no duration specified is inherited from the enclosing instance expiration. Use flux-alloc(1) to launch the test instance instead of running standalone flux-start(1) instances since this better simulates the intended use case.
Codecov Report
@@ Coverage Diff @@
## master #1104 +/- ##
======================================
Coverage 71.8% 71.8%
======================================
Files 88 88
Lines 11524 11525 +1
======================================
+ Hits 8279 8280 +1
Misses 3245 3245
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add my approval FWIW since we're on a tight timeline for a tag today, we need this fixed, and there's a regression test. Perhaps if @milroy thinks follow up is needed can do that in another PR?
Reimplemented the change here with a different fix
Thanks! I'll set MWP. |
When a resource graph has a fixed duration (i.e. a non-zero expiration), Fluxion currently assigns the graph duration to job requests that do not specify a duration. This results in job expirations that exceed the instance lifetime.
This PR fixes that by adjusting unspecified jobspec durations to the remaining graph duration, instead of the entire graph duration. This results in the expiration of these jobs matching that of the instance.
The existing test that checked this behavior was also updated and fixed.